Skip to content

[multi-gpu] Phase 1: namespace channel_type, add cross-rank attrs, doc plan#1576

Merged
erwei-xilinx merged 6 commits into
Xilinx:mainfrom
erwei-xilinx:multigpu-phase1-channel-types-and-cross-rank
May 6, 2026
Merged

[multi-gpu] Phase 1: namespace channel_type, add cross-rank attrs, doc plan#1576
erwei-xilinx merged 6 commits into
Xilinx:mainfrom
erwei-xilinx:multigpu-phase1-channel-types-and-cross-rank

Conversation

@erwei-xilinx
Copy link
Copy Markdown
Collaborator

@erwei-xilinx erwei-xilinx commented May 3, 2026

First step toward multi-GPU messaging support. Pure IR/dialect changes — no lowering yet (Phases 2–7 land separately as #1577#1582).

Summary

channel_type namespace rename (Option 1)

Existing values gain a npu_ prefix to make backend scope explicit:

Before After
dma_stream (default) npu_dma_stream
dma_packet npu_dma_packet
cascade npu_cascade
mmio npu_mmio

Mechanical rename across 33 files (verifier, transform/conversion passes, all .mlir tests, Python programming examples).

New GPU multi-rank channel type

  • gpu_symmetric_heap: cross-rank channel through the symmetric heap runtime (runtime_lib/airgpu/symmetric_heap.{h,cpp}). Verifier requires put/get sites to be inside an air.rank scope.

air.dma_memcpy_nd cross-rank addressing

  • Optional src_rank / dst_rank integer attributes name a peer rank in the enclosing air.rank scope.
  • Verifier requires:
    • an enclosing air.rank scope
    • the peer-side memref's memref.alloc (when directly available) to carry the air.symmetric attribute
  • Backward-compatible custom builder so existing callers compile unchanged.

air.symmetric memref attribute

A unit attribute on memref.alloc indicating the allocation should be backed by the symmetric heap. Documented in docs/AIRComputeModel.md §2.7.

Documentation

docs/AIRComputeModel.md updated to describe the new IR surface:

  • §2.4 cross-rank addressing on air.dma_memcpy_nd
  • §2.5 channel_type table including the npu_* rename and gpu_symmetric_heap
  • §2.7 air.symmetric memref attribute
  • §5 summary table updated for cross-rank / multi-GPU concepts

Test plan

  • All 21 mlir/test/Dialect/AIR/ tests pass (positive round-trip + verifier negatives)
  • New air_cross_rank_dma.mlir: round-trip for src_rank/dst_rank, air.symmetric memref, gpu_symmetric_heap channel inside air.rank
  • air_channel_invalid.mlir: gpu_symmetric_heap put/get outside air.rank rejected
  • air_memcpy_invalid.mlir: src_rank/dst_rank outside air.rank rejected, missing air.symmetric on alloc rejected
  • CI clang-format / black format
  • AIE backend regression (covered by CI)
  • GPU end-to-end (Phases 2–7 in [multi-gpu] Phase 2: hand-written e2e test for symmetric-heap multi-GPU #1577[multi-gpu] Phase 7: aircc integration (--multi-gpu flag) #1582 build on this PR; all 5 INPUT variants pass at W=2/4/8 on rad-mi325x-1)

🤖 Generated with Claude Code

@erwei-xilinx erwei-xilinx force-pushed the multigpu-phase1-channel-types-and-cross-rank branch from abbc586 to 38b7e10 Compare May 3, 2026 20:21
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Apply clang-format-17 reflow to three .cpp files (text-string wrapping
across the renamed channel_type values "npu_mmio" / "npu_cascade" /
"npu_dma_stream") and black reformat to one .py file (npu_cascade arg
list now exceeds the line limit).

These were reported by the lintAndFormat workflow on PR Xilinx#1576; this
commit folds them into Phase 1 so the diff CI saw is what's now in tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx marked this pull request as ready for review May 6, 2026 01:04
Copilot AI review requested due to automatic review settings May 6, 2026 01:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR is Phase 1 of multi-GPU messaging support by extending the AIR IR surface: it namespaces existing NPU channel types, adds a GPU-specific symmetric-heap channel type, and introduces cross-rank addressing attributes plus an air.symmetric allocation marker, along with corresponding verifier rules, tests, and documentation updates.

Changes:

  • Renames existing channel_type values to npu_* to make backend scope explicit.
  • Adds gpu_symmetric_heap channel type (rank-scoped) and cross-rank src_rank/dst_rank attributes on air.dma_memcpy_nd gated by air.rank + air.symmetric.
  • Updates verifier logic, MLIR tests, examples, and compute model documentation to cover the new IR surface.

Reviewed changes

Copilot reviewed 45 out of 45 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
programming_examples/matrix_vector_multiplication/bf16_cascade/matvec_cascade.py Updates cascade channel example to use npu_cascade.
programming_examples/herd_dataflow/run.py Updates default and cascade channel_type strings to npu_*.
programming_examples/herd_dataflow/air.mlir Updates channel declarations/comments to npu_* naming.
programming_examples/flash_attention/kernel_fusion_based/attn_npu2.py Renames cascade channels to npu_cascade.
programming_examples/flash_attention/kernel_fusion_based/attn_npu1.py Renames cascade channels to npu_cascade.
programming_examples/flash_attention/dataflow_based/attn.py Renames cascade channel attribute to npu_cascade.
programming_examples/channel_examples/mmio/mmio.py Renames mmio channel type to npu_mmio and updates docstring.
programming_examples/channel_examples/dual_herd_packet_switch/dual_herd_packet_switch.py Updates comment to refer to npu_dma_packet.
programming_examples/channel_examples/channel_3d_segment_unroll/channel_3d_segment_unroll.py Renames cascade channel to npu_cascade and reformats call.
programming_examples/cascade_reduction/cascade_reduction.py Renames cascade channel to npu_cascade.
mlir/test/Transform/AIRMiscPasses/air_split_l2_memref.mlir Updates FileCheck expectations to npu_dma_packet.
mlir/test/Transform/AIRMiscPasses/air_collapse_herd.mlir Updates cascade channel types to npu_cascade.
mlir/test/Transform/AIRHerdPlacement/cascade_placement.mlir Updates cascade channel declarations to npu_cascade.
mlir/test/Transform/AIRDmaToChannel/dma_to_channel_no_auto_packet.mlir Updates negative checks to npu_dma_packet.
mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet.mlir Updates expected upgraded channel types to npu_dma_packet.
mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet_single_herd.mlir Updates expected upgraded channel types to npu_dma_packet.
mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet_broadcast.mlir Updates broadcast upgrade expectations to npu_dma_packet.
mlir/test/Transform/AIRDependencyScheduleOpt/fuse_channels.mlir Updates stream/packet channel types to npu_* for non-fusion test.
mlir/test/Dialect/AIR/air_memcpy_invalid.mlir Adds verifier-negative tests for cross-rank src_rank/dst_rank and missing air.symmetric.
mlir/test/Dialect/AIR/air_cross_rank_dma.mlir New round-trip tests for cross-rank DMA attrs, air.symmetric, and gpu_symmetric_heap.
mlir/test/Dialect/AIR/air_channel.mlir Updates channel type round-trips and adds gpu_symmetric_heap parse/print coverage.
mlir/test/Dialect/AIR/air_channel_invalid.mlir Updates allowlist diagnostic and adds verifier negatives for gpu_symmetric_heap outside air.rank.
mlir/test/Dialect/AIR/air_canonicalize.mlir Updates cascade channel type to npu_cascade in canonicalization test.
mlir/test/Conversion/ConvertToAIR/scf_parallel_to_herd.mlir Updates cascade channel check to npu_cascade.
mlir/test/Conversion/AIRToAIE/shim_pkt_channel_sharing.mlir Updates packet channels to npu_dma_packet.
mlir/test/Conversion/AIRToAIE/shim_packet_flow_npu.mlir Updates packet channel types to npu_dma_packet.
mlir/test/Conversion/AIRToAIE/shared_shim_channel_packet_ids.mlir Updates packet channel declarations to npu_dma_packet.
mlir/test/Conversion/AIRToAIE/segment_unroll_packet_flow_ids.mlir Updates intra-device packet channels to npu_dma_packet.
mlir/test/Conversion/AIRToAIE/good_shim_packet_flow_npu_4col.mlir Updates packet channel to npu_dma_packet.
mlir/test/Conversion/AIRToAIE/bad_shim_packet_flow_npu_1col.mlir Updates packet channel to npu_dma_packet.
mlir/test/Conversion/AIRToAIE/air_shimcpy_to_npu.mlir Updates multiple packet channel types to npu_dma_packet.
mlir/test/Conversion/AIRToAIE/air_channel_to_locks_core_to_core.mlir Updates cascade channel declarations to npu_cascade.
mlir/test/Conversion/AIRToAIE/air_channel_mmio.mlir Updates mmio tests to npu_mmio and stream default to npu_dma_stream.
mlir/test/Conversion/AIRToAIE/air_channel_mmio_invalid.mlir Updates mmio-negative diagnostics and channels to npu_mmio.
mlir/lib/Util/Util.cpp Changes default inferred channel type to npu_dma_stream.
mlir/lib/Transform/AIRMiscPasses.cpp Updates cascade detection to npu_cascade.
mlir/lib/Transform/AIRLinalgCodegen.cpp Updates generated channel default to npu_dma_stream.
mlir/lib/Transform/AIRHerdPlacementPass.cpp Updates cascade channel collection to npu_cascade.
mlir/lib/Transform/AIRDmaToChannel.cpp Updates created/upgraded channel types to npu_* and mmio exclusion to npu_mmio.
mlir/lib/Dialect/AIR/IR/AIRDialect.cpp Adds cross-rank DMA verification, enforces rank-scope for gpu_symmetric_heap put/get, and updates channel_type allowlist to namespaced values.
mlir/lib/Conversion/ConvertToAIRPass.cpp Updates cascade channel creation to tag npu_cascade.
mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp Updates internal resource type strings to npu_* and mmio handling to npu_mmio.
mlir/lib/Conversion/AIRToAIEPass.cpp Updates mmio gating and resource-type branching to npu_* names.
mlir/include/air/Dialect/AIR/AIR.td Adds src_rank/dst_rank attrs, changes default channel_type to npu_dma_stream, and documents gpu_symmetric_heap.
docs/AIRComputeModel.md Documents cross-rank DMA attrs, namespaced channel types, and air.symmetric attribute; updates summary tables accordingly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp Outdated
Comment thread docs/AIRComputeModel.md Outdated
Comment thread docs/AIRComputeModel.md Outdated
Comment thread mlir/include/air/Dialect/AIR/AIR.td
Comment thread mlir/include/air/Dialect/AIR/AIR.td
Comment thread mlir/lib/Dialect/AIR/IR/AIRDialect.cpp
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Six Copilot comments on PR Xilinx#1576:

1. AIRToAIESchedulingUtils.cpp: four diagnostic strings still said
   "dma_stream / dma_packet" after the rename to "npu_dma_stream /
   npu_dma_packet". Updated.

2. docs/AIRComputeModel.md (cross-rank DMA, §2.4): said the GPU
   backend lowers src_rank/dst_rank, contradicting the summary table
   that calls it "planned". Reworded as "planned: air-cross-rank-dma-
   to-mgpu" to match.

3. docs/AIRComputeModel.md (air.symmetric, §2.7): same inconsistency
   for mgpuSymmetricAlloc routing. Reworded as "planned:
   air-symmetric-alloc-to-mgpu".

4. AIR.td (DmaMemcpyNdOp description): same inconsistency. Reworded.

5. AIR.td (gpu_symmetric_heap channel_type description): claimed
   "Lowered by air-to-rocdl to thread-cooperative loops..." with no
   such lowering yet in tree. Reworded as "planned:
   air-gpu-channel-to-mgpu".

6. AIRDialect.cpp DmaMemcpyNdOp::verify: rank indices are
   non-negative. Added explicit `>= 0` check, plus matching verifier-
   negative tests in air_memcpy_invalid.mlir for both src_rank=-1 and
   dst_rank=-3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx and others added 5 commits May 6, 2026 04:47
…c plan

Step toward multi-GPU messaging support per docs/MultiGPUPlan.md. Pure IR/dialect
changes — no lowering yet.

## channel_type namespace rename (Option 1)

Existing channel_type values gain a `npu_` prefix to make backend scope explicit:
- `dma_stream` → `npu_dma_stream` (default)
- `dma_packet` → `npu_dma_packet`
- `cascade`    → `npu_cascade`
- `mmio`       → `npu_mmio`

Mechanical rename across 33 files (verifier, transform/conversion passes, all
.mlir tests, Python programming examples).

## New channel_type for GPU multi-rank messaging

- `gpu_symmetric_heap`: cross-rank channel through the symmetric heap runtime
  (runtime_lib/airgpu/symmetric_heap.{h,cpp}). Verifier requires put/get sites
  to be inside an `air.rank` scope.

## air.dma_memcpy_nd cross-rank addressing

- New optional integer attributes `src_rank` / `dst_rank` name a peer rank in
  the enclosing `air.rank` scope.
- Verifier requires:
  - an enclosing `air.rank` scope
  - the peer-side memref's `memref.alloc` (when directly available) to carry
    the `air.symmetric` attribute
- Backward-compatible builder so existing call sites compile unchanged.

## air.symmetric memref attribute

A unit attribute on `memref.alloc` indicating the allocation is backed by the
symmetric heap. Documented in docs/AIRComputeModel.md §2.7.

## Documentation

- New docs/MultiGPUPlan.md: full design and 7-phase implementation plan
- docs/AIRComputeModel.md: §2.4 cross-rank addressing, §2.7 air.symmetric,
  §2.5 channel_type table updated, §5 summary table updated

## Tests

- mlir/test/Dialect/AIR/air_cross_rank_dma.mlir (new): positive round-trip
  for src_rank/dst_rank, air.symmetric memref, gpu_symmetric_heap channel
  put/get inside air.rank
- mlir/test/Dialect/AIR/air_channel_invalid.mlir: gpu_symmetric_heap
  put/get outside air.rank rejected; updated unsupported channel_type
  error message
- mlir/test/Dialect/AIR/air_memcpy_invalid.mlir: src_rank/dst_rank
  outside air.rank rejected; missing air.symmetric on alloc rejected

All 21 mlir/test/Dialect/AIR/ tests pass; GPU dma_copy and 4k_4k_mul e2e
tests pass on MI300A.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply clang-format-17 reflow to three .cpp files (text-string wrapping
across the renamed channel_type values "npu_mmio" / "npu_cascade" /
"npu_dma_stream") and black reformat to one .py file (npu_cascade arg
list now exceeds the line limit).

These were reported by the lintAndFormat workflow on PR Xilinx#1576; this
commit folds them into Phase 1 so the diff CI saw is what's now in tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Six Copilot comments on PR Xilinx#1576:

1. AIRToAIESchedulingUtils.cpp: four diagnostic strings still said
   "dma_stream / dma_packet" after the rename to "npu_dma_stream /
   npu_dma_packet". Updated.

2. docs/AIRComputeModel.md (cross-rank DMA, §2.4): said the GPU
   backend lowers src_rank/dst_rank, contradicting the summary table
   that calls it "planned". Reworded as "planned: air-cross-rank-dma-
   to-mgpu" to match.

3. docs/AIRComputeModel.md (air.symmetric, §2.7): same inconsistency
   for mgpuSymmetricAlloc routing. Reworded as "planned:
   air-symmetric-alloc-to-mgpu".

4. AIR.td (DmaMemcpyNdOp description): same inconsistency. Reworded.

5. AIR.td (gpu_symmetric_heap channel_type description): claimed
   "Lowered by air-to-rocdl to thread-cooperative loops..." with no
   such lowering yet in tree. Reworded as "planned:
   air-gpu-channel-to-mgpu".

6. AIRDialect.cpp DmaMemcpyNdOp::verify: rank indices are
   non-negative. Added explicit `>= 0` check, plus matching verifier-
   negative tests in air_memcpy_invalid.mlir for both src_rank=-1 and
   dst_rank=-3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit (888bcaa) added a `>= 0` verifier on src_rank /
dst_rank, but used `getSrcRank()` / `getDstRank()` — those return
`std::optional<uint64_t>` (a TableGen quirk for `OptionalAttr<I64Attr>`),
so `*sr < 0` on the unsigned value is always false and the check never
fired. The two new verifier-negative tests in air_memcpy_invalid.mlir
silently regressed.

Switch to the typed `getSrcRankAttr()` / `getDstRankAttr()` accessors
which return `IntegerAttr`, then call `.getInt()` for a real `int64_t`.
The check now fires on negative values; both negative-rank tests pass
under `lit -sv ../../mlir/test/Dialect/AIR` (21/21).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
origin/main grew 5 new herd-placement tests via Xilinx#1583 that use the
pre-rename `channel_type = "cascade"`. After this PR's namespace rename
("cascade" -> "npu_cascade"), those tests fail under air-opt with the
verifier rejecting the old name. Update them to "npu_cascade" so they
keep passing on top of phase 1.

Verified on rad-mi300a-sh5-1: AIRHerdPlacement 15/15 pass, Dialect/AIR
21/21 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the multigpu-phase1-channel-types-and-cross-rank branch from 90c90d6 to 965f853 Compare May 6, 2026 04:48
CI on 'Build and Test with AIE tools on Ryzen AI (amdhx370)' caught one
more stale "cascade" reference: test/xrt/34_cascade_vecadd/run_peano.py
embeds an inline MLIR string that declared `channel_type = "cascade"`.
Update to "npu_cascade" to match the namespace rename. The corresponding
run_chess.py variant didn't have this issue.

Verifier diagnostic from the failing job:
  'air.channel' op unsupported channel_type "cascade"; expected one of
  "npu_dma_stream", "npu_dma_packet", "npu_cascade", "npu_mmio", or
  "gpu_symmetric_heap"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx added this pull request to the merge queue May 6, 2026
Merged via the queue into Xilinx:main with commit fd62d7c May 6, 2026
27 checks passed
@erwei-xilinx erwei-xilinx deleted the multigpu-phase1-channel-types-and-cross-rank branch May 6, 2026 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants